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Abstract 

Link prediction is one of the fundamental problems in network analysis. In many applica- 
tions, notably in genetics, a partially observed network may not contain any negative examples 
of absent edges, which creates a difficulty for many existing supervised learning approaches. 
We develop a new method which treats the observed network as a sample of the true network 
with different sampling rates for positive and negative examples. We obtain a relative ranking 
of potential links by their probabilities, utilizing information on node covariates as well as on 
network topology. Empirically, the method performs well under many settings, including when 
the observed network is sparse. We apply the method to a protein-protein interaction network 
and a school friendship network. 



1 Introduction 



A variety of data in many different fields can be described by networks. Examples include friendship 
and social networks, food webs, protein-protein interaction and gene regulatory networks, the World 
Wide Web, and many others. 

One of the fundamental problems in network science is link prediction, where the goal is to predict 
the existence of a link between two nodes based on observed links between other nodes as well 
as additional information about the nodes (node covariates) when available (see |17| . |16| and 
[7] for recent reviews). Link prediction has wide applications. For example, recommendation of 
new friends or connections for members is an important service in online social networks such as 
Facebook. In biological networks, such as protein-protein interaction and gene regulatory networks, 
it is usually time-consuming and expensive to test existence of links by comprehensive experiments; 
link prediction in these biological networks can provide specific targets for future experiments. 



There are two different settings under which the link prediction problem is commonly studied. In 
the first setting, a snapshot of the network at time t, or a sequence of snapshots at times 1, t, is 
used to predict new links that are likely to appear in the near future (at time t + 1). In the second 
setting, the network is treated as static but not fully observed, and the task is to fill in the missing 
links in such a partially observed network. These two tasks are related in practice, since a network 
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evolving over time can also be partially observed and a missing link is more likely to emerge in the 
future. From the analysis point of view, however, these settings are quite different; in this paper, 
we focus on the partially observed setting and do not consider networks evolving over time. 

There are several types of methods for the link prediction problem in the literature. The first 
class of methods consists of unsupervised approaches based on various types of node similarities. 
These methods assign a similarity score s(i,j) to each pair of nodes i and j, and higher similarity 
scores are assumed to imply higher probabilities of a link. Similarities can be based either on node 
attributes or solely on the network structure, such as the number of common neighbors; the latter 
are known as structural similarities. Typical choices of structural similarity measures include local 
indices based on common neighbors, such as the Jaccard index [16] or the Adamic-Adar index [I], 
and global indices based on the ensemble of all paths, such as the Katz index [13] and the Leicht- 
Holme-Newman Index [15] . Comprehensive reviews of such similarity measures can be found in 
[E] and p2]. 

Another class of approaches to link prediction includes supervised learning methods that use both 
network structures and node attributes. These methods treat link prediction as a binary classi- 
fication problem, where the responses are {1,0} indicating whether there exists a link for a pair, 
and the predictors are covariates for each pair, which are constructed from node attributes. A 
number of popular supervised learning methods have been applied to the link prediction problem. 
For example, [2] and [4] use the support vector machine with pairwise kernels, and [8j compares the 
performance of several supervised learning methods. Other supervised methods use probabilistic 
models for incomplete networks to do link prediction, for example, the hierarchical structure models 
[6], latent space models [TT] . latent variable models pUlIE], and stochastic relational models [2T] . 

Our approach falls in the supervised learning category, in the sense that we make use of both 
the node similarities and observed links. However, one difficulty in treating link prediction as 
a straightforward classification problem is the lack of certainty about the negative and positive 
examples. This is particularly true for negative examples (absent edges). In biological networks 
in particular, there may be no certain negative examples at all [3J. For instance, in a protein- 
protein interaction network, an absent edge may not mean that there is no interaction between the 
two proteins - instead, it may indicate that the experiment to test that interaction has not been 
done, or that it did not have enough sensitivity to detect the interaction. Positive examples could 
sometimes also be spurious - for example, high-throughput experiments can yield a large number 
of false positive protein-protein interactions [19] . Here we propose a new link prediction method 
that allows for the presence of both false positive and false negative examples. More formally, we 
assume that the network we observe is the true network with independent observation errors, i.e., 
with some true edges missing and other edges recorded erroneously. The error rates for both kinds 
of errors are assumed unknown, and in fact cannot be estimated under this framework. However, we 
can provide rankings of potential links in order of their estimated probabilities, for node pairs with 
observed links as well as for node pairs with no observed links. These relative rankings rather than 
absolute probabilities of edges are sufficient in many applications. For example, pairs of proteins 
without observed interactions that rank highly could be given priority in subsequent experiments. 
To obtain these rankings, we utilize node covariates when available, and/or network topology based 
on observed links. 

The rest of the paper is organized as follows. In Section [H we specify our (rather minimal) model 
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assumptions for the network and the edge errors. We propose link ranking criteria for both directed 
and undirected networks in Section [3l The algorithms used to optimize these criteria are discussed 
in Section [U In Section [5] we compare performance of proposed criteria to other link prediction 
methods on simulated networks. In Sectional we apply our methods to link prediction in a protein- 
protein interaction network and a school friendship network. Section [7] concludes with a summary 
and discussion of future directions. 



2 The network model 



A network with n nodes (vertices) can be represented by an n x n adjacency matrix A = [Aij], 
where 



1 if there is an edge from i to j, 
otherwise. 



We will consider the link prediction problem for both undirected and directed networks. Therefore 
A can be either symmetric (for undirected networks) or asymmetric (for directed networks). 

In our framework, we distinguish between the adjacency matrix of the true underlying network 
A True , and its observed version A. We assume that each Aj-~ ue follows a Bernoulli distribution 
with ¥(Afj" ae = 1) = Pij. Given the true network, we assume that the observed network is 
generated by 

P(Aj = l|4™ e = !) = «, P(Ai = 0\Aff ue =0)=P, 

where a and f3 are the probabilities of correctly recording a true edge and an absent edge, respec- 
tively. Note that we assume that this probability is constant and does not depend on i, j, or P^. 
Then we have 

P i:j D = f ¥(Aij = 1) = (a + p - l)Pij + (1 - 0). (1) 

If the values of a, f3 and Pij were known, then the probabilities of true edges conditional on the 
observed adjacency matrix could have been estimated as 

nAjr = MAj = i) = (2) 



p, 



■i.i 



F(Afr = MAj = 0) = "\ _p . (3) 

1 Pij 

It is easy to check that both ([2]) and ([3]) are monotone increasing functions of Pij. Taking (pQ) into 
account implies that they are also increasing functions of Pij as long as a + /3 > 1. This gives us 
a crucial observation: if the goal is to obtain relative rankings of potential links, it is sufficient to 
estimate Pij, and it is not necessary to know a, j3 and Pij. 

An important special case in this setting is /3 = 1. Then all the observed links are true positives, 
and we only need to provide a ranking for node pairs without observed links. This can be applied 
in recommender systems, for example, for recommending possible new friends in a social network. 
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Another special case is when a = 1, which corresponds to all absent edges being true negatives. 
This setting can be used to frame the problem of investigating reliability of observed links, for 
example, in a gene regulatory network inferred from high-throughput gene expression data. An 
estimate of [Pij] provides rankings for both these special cases and the general problem, and thus 
we focus on estimating P$j for the rest of the paper. 



3 Link prediction criteria 

In this section, we propose criteria for estimating the probabilities of edges in the observed network, 
Pij, for both directed and undirected networks. The criteria rely on a symmetric matrix W = [Ww] 
with < Wa' < 1, which describes the similarity between nodes i and i! . The similarity matrix 
W can be obtained from different sources, including node information, network topology, or a 
combination of the two. We will discuss choices of W later in this section. 

3.1 Link prediction for directed networks 

First we consider directed networks. The key assumption we make is that if two pairs of nodes are 

O / 

Of 

Figure 1: Pair similarity for directed networks 

similar to each other, the probability of links within these two pairs are also similar. Specifically, 
in Figure HJ Py and P^ji are assumed close in value if node i is similar to node i' and node j is 
similar to node f. For directed networks, we measure similarity of node pairs and (j,f) 

by the product Wa>Wjj> (see Figure [1]), which implies two pairs are similar only if both pairs of 
endpoints are similar. This assumption should not to be confused with a different assumption made 
by many unsupervised link prediction methods, which assume that a link is more likely to exist 
between similar nodes, applicable to networks with assortative mixing. Assortative networks are 
common - a typical example is a social network, where people commonly tend to be friends with 
those of similar age, income level, race, etc. However, there are also networks with disassortative 
mixing, in which the assumption that similar pairs are more likely to be connected is no longer 
valid - for example, predators do not typically feed on each other in a food web. Our assumption, 
in contrast, is equally plausible for both assortative and disassortative networks, as well as more 
general settings, as it does not assume anything about the relationship between P^ and W«. 

Motivated by this assumption of similar probabilities of links for similar node pairs, we propose to 
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estimate Py = E(Aij) by 

1 n X n 

f = argmin - + U'Uu ~ fi'j'?, (4) 

y m 33 

where / is a real- valued rexn matrix, and A is a tuning parameter. The first term is the usual squared 
error loss connecting the parameters with the observed network. The minimizer of its population 
version, i.e., E(Aj,- — fij) 2 is P«. The second term enforces our key assumption, penalizing the 
difference between and fyy more if two node pairs (£,«') and (j,f) are similar. The choice of 
the squared error loss is not crucial, and other commonly used loss functions could be considered 
instead, for example, the hinge loss or the negative log- likelihood. The main reason for choosing 
the squared error loss is computational efficiency, since it makes (jU) a quadratic problem; see more 
on this details in Section 01 

In some applications, we may have additional information about true positive and negative exam- 
ples, i.e., some ^4jj's may be known to be true l's and true O's, while others may be uncertain. 
This could happen, for example, when validation experiments have been conducted on a subset of 
a gene or protein network inferred from expression data. If such information is available, it makes 
sense to use it, and we can then modify criterion ([J]) as follows: 

1 n A n 

argmin ^ - fij? + — Yl W a' W jj'Uij ~ fi'j'f, ( 5 ) 
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where Eij = 1 if it is known that = Ajj Tue , and otherwise. This is similar to a semi-supervised 
criterion proposed in [13]. However, [13J did not consider the uncertainty in positive and negative 
examples, nor did they consider the undirected case which we discuss next. Since © only involves 
a partial sum of the loss function terms, we will refer to ([S} as the partial-sum criterion and @ as 
the full-sum criterion for the rest of the paper. 



3.2 Link prediction for undirected networks 

O J ' • O J 



---Or f%----0>' 



Figure 2: Pair similarity for undirected networks 

For undirected networks, our key assumption that Pij and P\iy are close if two pairs and 
are similar needs to take into account that the direction no longer matters; thus the pairs 
are similar if either i is similar to i' and j is similar to f, or if i is similar to j' and j is similar to i' 
(see Figure [2j Thus we need a new pair similarity measure that combines WwWjj> and Wij'Wji'. 
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There are multiple options; for example, two natural combinations are 



Si = WutWjji + WifWji' and S2 = max(Wu'Wjj> ,Wij>Wji>) . 

Empirically, we found that S2 performs better than S\ for a range of real and simulated networks. 
The reason for this can be easily illustrated on the stochastic block model. The stochastic block 
model is a commonly used model for networks with communities, where the probability of a link 
only depends on the community labels of its two endpoints. Specifically, given community labels 
c = {ci, . . . , c n }, Ajj Tue, s are independent Bernoulli random variables with 

Pij = S CiCj , (6) 

where S = [Sat,] is a K x K symmetric matrix, and K is the number of communities in the 
network. Suppose we have the best similarity measure we can possibly hope to have based on the 
truth, Wij = I(ci = Cj), where I is the indicator function. In that case, (JBJ) implies Pij = P^y if 
max(Wu'Wjji, WijiWji') = 1, whereas the sum of the weights would be misleading. 

Using 52 as the measure of pair similarity, we propose estimating for undirected networks by 

1 n 

arg min — - Ujf + (7) 

A n 

-j ^2 maxiWu'WjfjWij'Wji'Xfij-fi'f) 2 . 

i<j,i'<j' 

Similarly to the directed case, if we have information about true positive and negative examples, 
we can use a partial-sum criterion 

1 n 

arg min = — — ^ (A„ - f^) 2 + (8) 
/ ./ r -'J , , 

A n 

— ^2 v^(Wii'Wj?,Wij'Wjii)(fij - fi'j') 2 , 

i<j,i'<j' 

where = 1 if it is known that A{j = Aj- rue , otherwise Eij = 0. 



3.3 Node similarity measures 



The last component we need to specify is the node similarity matrix W. One typical situation is 
when we have reasons to believe that the external node covariates are related to the structure of 
the network, in which case it is natural to use covariate information to construct Wn>. Though 
more complicated formats do exist, node covariates are typically represented by an n x p matrix 
X where is the value of variable k on node i. Then Wn> can be taken to be some similarity 
measure between the i-th and i'-th rows of X. For example, if X contains only numerical variables 
and has been standardized, we can use the exponential decay kernel, 
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where || • || is the Euclidean vector norm. 

When node covariates are not available, node similarity Wai is usually obtained from the topology 
of the observed network A, i.e., Wn< is large if i and i' have a similar pattern of connections with 
other nodes. For undirected networks, a simple choice of Wu> could be 

= \{k:A lk = A l>k }\ 
n 

where | • • • | denotes cardinality of a set. This particular measure turns out to be not very useful: 
since most real networks are sparse, most entries of any /c-th column will be 0, and thus most of 
Wui's would be large. A more informative measure is the Jaccard index [IB] . 



\N(i)nN(i')\ 

W * = \N{i)UN(i>)Y (10) 
where N(i) = {k : Ai k = 1} is the set of neighbors of node i. 

The directed networks case is similar, except we need to count the in and the out links separately. 
The formulas corresponding to © and ([TO]) become 

\{k : A ik = Ai, k }\ \{k : A ki = A ki ,}\ 
iV = 2^ + 2^ ' 

|iVi(i)niVi(»0| , |iv 2 (0 n jv 2 (*')I 



2\N 1 (i)UN 1 {i')\ 2\N 2 (i)UN 2 (i')\' 
where Ni(i) = {k : A ik = 1} and N 2 {i) = {k : A ki = 1}. 



4 Optimization algorithms 



The proposed link prediction criteria are convex and quadratic in parameters, and thus optimization 
is fairly straightforward. The obvious approach is to treat the matrix / as a long vector with n 2 
elements (or n(n — l)/2 in the undirected case), and solve the linear system obtained by taking 
the first derivative of any criterion above with respect to this vector. However, solving a system of 
linear equations could be challenging for large-scale problems [5]; the number of parameters here is 
0(n 2 ), and so the linear system requires 0(n 4 ) memory. However, if W is sparse, or sparsified by 
applying thresholding or some other similar method, then solving the linear system is the efficient 
choice. 

If the W matrix is not sparse, an iterative algorithm with sequential updates that only requires 
0(n 2 ) memory would be a better choice than solving the linear system. We propose an iterative 
algorithm following the idea of block coordinate descent [9j [20] . A block coordinate descent algo- 
rithm partitions the coordinates into blocks and iteratively optimizes the criterion with respect to 
each block while holding the other blocks fixed. 

First, we derive the update equations for directed networks. Note (pE} and ([5]) can be written in the 
general form 

1 n . n 

Q = -2 E V ^ A V ~ Uf + -4 E Wii'W^ifij - hyf, (11) 
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where Vij = 1 for (J3|) and V« = Eij for ([5]). For any matrix M, let Mj. be the zth row of M. We 
treat as a block, and update /j. iteratively. Define Vi = diag(T^.). Then 



(12) 



Let D be an n x n diagonal matrix with Da = ]TV- W%j. Then 



= fl.DU - 2f?Wh. + 



(13) 
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Plugging (|12() and fjl3[> into (llip . and taking the first derivative of Q with respect to fi., we obtain 

2 



:(f i .-A i .)V i + 



(14) 



A4 

n 



Solving -g^- = with respect to fp , we obtain the updating formula 



'■A i .V i + 2XY,W iV Wfy, 



(*) 



(15) 



where is the value of fi. at iteration i. 



This update is fast to compute but its derivation relies on the product form of Wa' and Wjf, and 
thus is not directly applicable in the undirected case, where £2 is used as the similarity measure. 
However, we can still approximate S2 with a product, using the fact that for x > 0, y > 0, 
linig^oo ^/x q + y q = max(x,y). Thus, for sufficiently large q, we have 



[max(WWW^ 5 WifWji,)} 9 « (^%)' + (WVW, y )« 



(16) 



Further, PF 9 is a monotone transformation of W and can also serve as a similarity measure. Based 
on (|16p . we propose to substitute the following approximate criterion for undirected networks, 



1 n 



(17) 



A 



i<j,i'<j' 

where V« = 1 for the full sum criterion and Vij = for the partial sum criterion. By symmetry, 



Y ((Wu'Wjj,)* + (w ij ,w ji ,y)(f ij - u r 



i<j,i <j 



1 n 
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This is now in the same form as (|lip . with each term in the sum containing a product of Wa> and 
Wjji, and therefore fjlTf) can be solved by block coordinate descent with an analogous updating 
equation as that in the directed network case. 

In practice, we found that when W is sparse or truncated to be sparse, solving the linear system 
can be much faster than the block coordinate descent method; however, when W is dense and 
the number of nodes is reasonably large, the block coordinate descent method dominates directly 
solving linear equations. 



5 Simulation studies 



In this section, we test performance of our link prediction methods on simulated networks. In 
all cases, each network consists of n = 1000 nodes, and node i's covariates Xi are independently 
generated from a multivariate normal distribution N p (0,I p ) with p = 5. Each Afj~ ue is generated 
independently, with logitPy = f(Xi,Xj). We consider the following functions f(Xi,Xj): 

(a) ^pQ fc -X ifc ), (a) J2( X * ~ X 3k) ~ 8 > 

k k 

(b) 2XfX J /\\X J \\, (&') xTXj/WXjW-6, 

(c) J2^ + X 3k), (c) J2(X ik + X jk )-8, 

k k 

(d) xTXj, (<f) XfXj-6. 

The right hand column gives sparser versions of functions in the left hand column (subtracting 
a constant within the logit link functions lowers the overall degree), which we use to compare 
dense and sparse networks (the average degrees of all these networks are reported in Figures [3] 
and . Functions (a) and (b) are asymmetric in Xj and Xj, giving directed networks, while (c) 
and (d) are symmetric functions corresponding to undirected networks. Further, (a) and (c) are 
linear functions; (b) is the projection model proposed in [11], under which the link probability is 
determined by the projection of Xi onto the direction of Xj, and (d) is an undirected version of 
the projection model. 

We also generate indicators E^'s as independent Bernoulli variables taking values 1 and with 
equal probability, and set A*j = EijA^-: ue . This setup corresponds to the "partially observed" 
network of the title, where all the observed edges are true but the missing edges may or may not 
be true 0s. 



Since we have node covariates affecting the probabilities of links in this case, we define the similarity 
matrix W by 



Wu> = exp 



I Xi — Xij 1 1 2 



a 2 



where we choose a = |median{||Xj — = l,...,n,i' = l,...,n}. After truncating W at 0.1, we 

optimize all criteria by solving linear equations, with A chosen by 5-fold cross validation. 
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The performance of link prediction is evaluated on the "test" set ■ Ey = 0}. We report ROC 

curves, which only depend on the rankings of the estimates rather than their numerical values. 
Specifically, let Rij be the ranking of on the test set in descending order. For any integer k, 
we define false positives as pairs ranked within top k but without links in the true network 
(Ajj rue = 0), and true positives as pairs ranked within top k with Ajj" ue = 1. Then the true positive 
rate (TPR) and the false positive rate (FPR) are defined by 



TPR(fc) 
FPR(fc) 



\{(i,j):E ij = 0,R i 3<k,^ wt = l}\ 



\{(i-fj :Ei. 



0, Afrue 



1}| 



\{(i,j):E ij = 0,R ij <k,A§ 



0}\ 



0, ATr™ 



0}\ 



The ROC curves showing the false positive rate vs. the true positive rate over a range of k values 
are shown in Figures [3] (directed networks) and 0] (undirected networks). Each curve is the average 
of 20 replicates. We also show the ROC curve constructed from true Pij's as a benchmark.. 



(a) d = 500 



(a') d = 13 





False positive rate 



{b) d = 500 



False positive rate 



(b') d = 15 





False positive rate 



False positive rate 



Figure 3: ROC curves for directed networks, d is the average degree over 20 replicates. 



Overall, both the full sum and the partial sum criteria perform well. There is little difference 
between directed network models and their undirected versions. As expected, the partial sum 
criterion always gives better results since it has more information and only uses the true positive 
and negative examples for training. But its performance is quite comparable to the completely 
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(c) d = 500 



(c') d = 13 




0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 

False positive rate False positive rate 



Figure 4: ROC curves for undirected networks, d is the average degree over 20 replicates. 

unsupervised full sum criterion, except perhaps for model (c). The gaps between the unsupervised 
full sum criterion and semi-supervised partial sum criterion become smaller for sparse networks, as 
the false negatives in the full sum are only a small proportion of the large number of true negatives 
in a sparse network. The ROC curve obtained from the true model in sparse networks is better 
than in the corresponding dense networks; this seemingly counter-intuitive finding is also explained 
by the large number of 0s in sparse networks. However, gaps between both our link prediction 
methods and the true model are larger in all the sparse networks than in their dense counterparts. 
This confirms the observation that a small number of positive examples in sparse networks makes 
the link prediction problem challenging. 



6 Applications 

6.1 The protein-protein interaction network 

Our first application is to an undirected network containing yeast protein-protein interactions from 
|19j . This network was edited to contain only highly reliable interactions supported by multiple 
experiments [I], resulting in 984 protein nodes and 2438 edges, with the average node degree about 
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5. We take this verified network to be the true underlying network A True . [3] also constructed 
a matrix measuring similarities between proteins based on gene expression, protein localization, 
phylogenetic profiles and yeast two-hybrid data, which we use as the node similarity matrix W for 
link prediction. 



(a) a = 0.8 (b) a = 0.5 (c) a = 0.2 




Figure 5: ROC curves for the protein-protein interaction network. 

Here, we compare the full sum criterion (|7j), the partial sum criterion ([8]), and the latent variable 
model proposed by [10]. To test prediction, we generate indicators E^s as independent Bernoulli 
variables taking value 1 with probability a, and set Aij = EijA?-~ ue . We consider three different 
values of a, a = 0.2, 0.5, 0.8, corresponding to different amounts of available information. 

We use the block coordinate descent algorithm proposed in Section 0] to approximately optimize 
d?]) and (JH]), with q = 10 and A chosen by cross-validation. The latent variable model depends on a 
tuning parameter K, the dimension of the latent space. We fix K = 5 since larger values of K do 
not significantly change the performance in this example. We again use ROC curves to evaluate 
the link prediction performance on the set ■ Eij = 0}. Each ROC curve in Figure [5] is the 

average of 10 random realizations of E^s. 

The semi-supervised criterion always performs better than the unsupervised criterion, as it should. 
Further, the semi-supervised criterion almost always outperforms the latent variable model, except 
for very small values of the false positive rate, and the fully unsupervised criterion also starts to 
outperform the latent variable model as the false positive rate increases. The latent variable model 
is also more sensitive to the sampling rate a, with performance deteriorating for a = 0.2. This 
is because the model relies heavily on the structure of the network, and a low sampling rate may 
substantially distort the overall network topology. On the other hand, we use the node similarity 
matrix W which depends only on the features of the proteins, and is thus unaffected by the sampling 
rate. 



6.2 The school friendship network 

This dataset is a school friendship network from the National Longitudinal Study of Adolescent 
Health (see |12| for detailed information). This network contains 1011 high school students and 
5459 directed links connecting students to their friends, as reported by the students themselves. 
The average degree of this network is also around 5. Here we test our two link prediction criteria, 
with the same settings for E^ as in the protein example. Since the latent variable model of |10j is 
not applicable to directed networks, we omit it here. Due to lack of node covariates, we construct 
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(a) a = 0.8 



(b) a = 0.5 



(c) a = 0.2 




Figure 6: ROC curves for the school friendship network. 



a network-based similarity W by using the Jaccard index defined in (|10p . We again apply block 
coordinate descent to minimize the criteria with A chosen by cross-validation, and report the average 
ROC curves over 10 realizations of Ey's. As shown in Figure [H both criteria perform fairly well 
for a = 0.8 and a = 0.5, but fail for a = 0.2, as the sampling rate is too small for W to capture the 
overall network topology. This does not happen in the protein-protein interactions network, since 
W is constructed from covariates on proteins and is unaffected by sub-sampling. 



7 Summary and future work 



In this article, we have proposed a new framework for link prediction that allows uncertainty in 
observed links and non-links of a given network. Our method can provide relative rankings of 
potential links for pairs with and without observed links. The proposed link prediction criteria 
are fully non-parametric and essentially model-free, relying only on the assumption that similar 
node pairs have similar link probabilities, which is valid for a wide range of network models. One 
direction we would like to explore in the future is to combine more specific parametric network 
models with our non-parametric approach, with the goal of achieving both robustness and efficiency. 
We are also investigating consistency properties of our method, which is challenging because it 
requires developing a novel theoretical framework for evaluating consistency of rankings. We are 
also developing extensions that would allow the probabilities of errors, a and /3, to depend on the 
underlying probabilities of links. This would allow, for example, making highly probable links more 
likely to be observed correctly. Ultimately, we would also like to incorporate the general framework 
of link uncertainty into other network problems, for example, community detection. 
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